PM2.5 Forecast in Korea using the Long Short-Term Memory (LSTM) Model

The National Institute of Environmental Research, under the Ministry of Environment of Korea, provides two-day forecasts, through AirKorea, of the concentration of particulate matter with diameters of ≤ 2.5 μm (PM2.5) in terms of four grades (low, moderate, high, and very high) over 19 districts nationwide. Particulate grades are subjectively designated by human forecasters based on forecast results from the Community Multiscale Air Quality (CMAQ) and artificial intelligence (AI) models in conjunction with weather patterns. This study evaluates forecasts from the long short-term memory (LSTM) algorithm relative to those from CMAQ-solely and AirKorea using observations from 2019. The skills of the one-day PM2.5 forecasts over the 19 districts were 39–70% for CMAQ, 72﻿–79% for LSTM, and 73﻿–80% for AirKorea; the AI forecasts showed comparable skills to the human forecasters at AirKorea. The one-day forecast skill levels of high and very high PM2.5 pollution grades are 31﻿–98%, 31﻿–74%, and 39﻿–81% for the CMAQ-solely, the LSTM, and the AirKorea forecasts, respectively. Despite good skills for forecasting the high and very high events, CMAQ-solely forecasts also generate substantially higher false alarm rates (up to 86%) than the LSTM and AirKorea forecasts (up to 58%). Hence, applying only the LSTM model to the CMAQ forecasts can yield reasonable forecast skill levels comparable to the operational AirKorea forecasts that elaborately combine the CMAQ model, AI models, and human forecasters. The present results suggest that applications of appropriate AI models can greatly enhance PM2.5 forecast skills for Korea in a more objective way. Supplementary Information The online version contains supplementary material available at 10.1007/s13143-022-00293-2.


Introduction
As societies have become more technologically advanced, energy demands have increased to result in deterioration of air quality (AQ). To meet the clean-air mandate, the Korean government has implemented numerous policies to reduce the emission of air pollutants since the early 1980s (Kim and lee 2018;Trnka 2020).
While the decrease of air pollutants concentration has been negligible since 2015 , the concentration of particulate matter (PM) with diameters ≤ 2.5 μm (PM 2.5 ) was found to have decreased significantly from the 2019 winter to the 2021 (NIER 2021; posts dated January 10, 2022, in the press release on http:// eng. me. go. kr/). The large reduction in the PM 2.5 was most likely due to the social distancing precautions that were taken during the coronavirus disease (COVID-19) pandemic period; when COVID-19 protocols subsided, PM 2.5 returned to the previous concentrations (Ju et al. 2021;Kwak et al. 2021). This is a great concern for the public health in Korea which has suffered from the phenomenon 'Sam-Han-Sa-Mi', which translates to the alternation between three cold and four polluted days.
The atmospheric PM concentrations are determined by multiple factors such as pollutants emissions, transboundary PM transports, ventilation, and scavenging (e.g., Choi et al. 2008Choi et al. , 2019Lee et al. 2011;Oh et al. 2020;Chang et al. 2021). Unless there are certain events or strict reduction measures that reduce pollutant emissions, PM 2.5 concentrations are expected to remain at elevated concentrations in the future. The National Institute of Environmental Research (NIER), under the Ministry of Environment of Korea, has been forecasting PM 2.5 levels in four grades (low, moderate, high, and very high) over 19 districts nationwide since February 2 of 2015 through a national AQ forecast system, AirKorea (www. airko rea. or. kr) to help people to prepare for deteriorating AQ. The Air-Korea system was organized with numerical modeling of Weather Research and Forecasting-Community Multiscale Air Quality (WRF-CMAQ) and subjective decision making by forecasters . Forecasters provide the PM 2.5 grade forecasts for two-day lead-times based on the current weather conditions, the movement of air pollutants, and the CMAQ results. While several different types of artificial intelligence (AI) algorithms were introduced to improve the PM forecasting, results of these attempts were known to be challenging to interpret (c.f., Kim et al. 2022). Therefore, the use of the AI model requires experts with domain knowledge instead of its excessive trust (Zhang et al. 2021;Tjoa and Guan 2021).
In recent years, the application of AI in PM forecasts via including nonlinearity has significantly improved prediction accuracy (Wu and Lin 2019;Xayasouk et al. 2020). Incorporation of the weather and AQ variables from both observations and model forecasts into the AI-model training could reduce forecast errors considerably . While the long short-term memory (LSTM) algorithm is similar to a recurrent neural network (RNN) in that it learns a relationship among sequentially connected information, it compensates for the shortcomings of RNN by having a smaller effect on long-distance information (Hochreiter and Schmidhuber 1997;Gers et al. 2000).
The current LSTM-based PM 2.5 forecast model is an improved version of the previous RNN model developed by Ho et al. (2021). The LSTM model modulates the CMAQ forecasts by incorporating an AI learning algorithm. The forecast skill of the LSTM model was compared to the Air-Korea and CMAQ-solely forecasts. It is noted that the present CMAQ is slightly different from AirKorea's; the configurations of the two models are described in supplementary  Table S1. Section 2 describes the data and operational 19 forecast districts, and Section 3 explains the organization of the LSTM model. Section 4 presents a comparison of the prediction results of the LSTM, AirKorea, and CMAQ models. Section 5 summarizes the results of this study.

Data and Forecast Districts
The inputs for the LSTM model were obtained from both observations and two model simulations (CMAQ and WRF), as described in Sections 2.1 and 2.2, respectively.

Observed AQ and Meteorological Variables
There are 260 AQ stations across Korea, mostly in and around megacities (denoted by open circles in Fig. 1). While Seoul is covered quite evenly by 25 stations in all administrative districts, there are often no observed stations in mountainous and/or rural regions, despite they cover substantially larger area than Seoul. Various 6-h mean values of AQ variables-PM of diameters ≤ 10 μm (PM 10 ), PM 2.5 , O 3 , SO 2 , NO 2 , and CO-were analyzed for the period 2015-21 (seven years; Table 1). Six-hour average meteorological variables (pressure, temperature, dew point temperature, relative humidity, and horizontal wind) at 97 automated surface observing systems (ASOS) across the nation (closed circles in Fig. 1) were also analyzed during the same seven years (Table 1).

Model Forecasted Results
The numerical AQ forecasts employ the CMAQ and WRF models for the simulations of AQ and meteorological data for driving CMAQ, respectively. There are four categories of the model results (Table 1). The first category includes the six CMAQ variables-PM 10 , PM 2.5 , O 3 , SO 2 , NO 2 , and CO-that were observed. The second is the WRF meteorological variables (geopotential height, temperature, relative humidity, and zonal, meridional, and vertical wind) at the surface and six atmospheric levels (1000, 925, 850, 700, 500, and 300 hPa). Third, cosine similarity and back-trajectory cluster values are calculated by processing meteorological variables from the WRF. The cosine similarity is an index representing the spatial similarity to the meteorological fields for observed high PM 2.5 concentration events (Hur et al. 2016), which are the same as the WRF meteorological variables at the same atmospheric level. Fourth, the backtrajectory cluster values are obtained by backtracking the air flow from the FLEXible PARTicle dispersion model (FLEX-PART, Stohl et al. 2005; see http:// flexp art. eu for details). FLEXPART has been widely utilized to identify the pathway of long trails of air pollutants. In this study, the air current flowing into the target district was tracked over the previous three days and the direction was indexed.

Operational Forecast for Four Grades in 19
Forecast Districts The CMAQ model forecasts PM 2.5 concentration values at all grids. However, NIER produces AQ forecasts over the 19 districts in terms of four AQ grades. The four grades in the NIER operational forecasts are defined in terms of the PM 2.5 , as follows: low (PM 2.5 ≤ 15 μg m −3 ), moderate (15 μ g m −3 < PM 2.5 ≤ 35 μg m −3 ), high (35 μg m −3 < PM 2.5 ≤ 75 μ g m −3 ), and very high (75 μg m −3 < PM 2.5 ). The 19 districts were determined by considering the spatial distribution of AQ stations, populations, administrative districts, and consistency of living areas (Fig. 1). Eight big cities (Busan, Daegu, Daejeon, Gwangju, Incheon, Sejong, Seoul, and Ulsan) are regarded as independent districts. Sejong was selected as the second administrative capital, although its current population is around 380,889 (July 2022). Gyeonggi-do and Gangwon-do were divided into four districts: Gyeonggi North and South and Gangwon East and West, while each of the seven provinces (Chungcheong North and South, Gyeongsang North and South, Jeju Island, and Jeolla North and South) were designated as independent districts. Note that Jeju Island (do in Korean) is a special autonomous province, and is therefore denoted as Jeju-do.
Although the 19 districts are operationally independent, air pollutant concentrations varied similarly in geographically adjacent districts. For example, the daily anomalous PM 2.5 concentrations in the Gangwon West and Seoul metropolitan area (including Seoul, Incheon, and Gyeonggi North and South; SMA hereafter) are strongly correlated with each other, with correlation coefficients of over 0.7 for the seven-year analysis period (significant at the 99% confidence level). The 19 forecast districts were therefore rearranged into six broad regions based on the interregional correlation coefficients when establishing the AI forecast model (table not shown). The six broad forecast regions colored differently in Fig. 1 are Chungcheong-do, Gangwon East, Gyeongsang-do, Jeju-do, Jeolla-do, and SMA + Gangwon West.  Day−1) to T15 (21:00 LT on Day+2) in Fig. 2. The prediction time was at T5, which corresponds to 09:00 LT on Day0. The day forecast (Day0) was from T6-T7. T1-T3 corresponds to one day before Day0 (Day−1), T8-T11 to one day ahead (Day+1), and T12-T15 to two days ahead (Day+2). As explained in Sects. 2.1 and 2.2, AQ and meteorological variables used for LSTM input data were obtained from observations and model simulations.

Flow Chart of LSTM Model Process
In Stage 2, data preprocessing consisted of regional and seasonal grouping, bagging ensembles, and normalization. Operational forecasts are produced for the 19 districts and thereby for the six broad forecast regions comprised of neighboring districts (see Section 2.3). The primary AQ and weather variables in controlling PM 2.5 concentrations varied seasonally. For example, a higher temperature and weaker ventilation occurred during stagnant periods, resulting in the deterioration of AQ. This implied that temperature and PM 2.5 concentrations tended to be positively correlated, mostly in winter ). However, there was a weak positive correlation between temperature and PMs in the summer. To consider the temporal variations in the influencing variables, the LSTM model was trained for 12 groups of three consecutive months (i.e., January-February-March, February-March-April, …, and December-January-February), as conducted by Ho et al. (2021). The bootstrap aggregating (i.e., bagging) ensemble method was as follows: N learners were generated through N random samplings from a dataset excluding a test set, which thereby averaged the predicted N outputs (Breiman 1996). An aggregation of multiple learners complemented the weaknesses of individual models, which resulted in higher performance. Our LSTM model had 40 learners for each regional and seasonal model, resulting in a total of 2880 (= 40 × 6 × 12) models. The testing year was 2019, and the remaining years (2015-18, 20, and 21) were arbitrarily separated into training (80%) and validation (20%) periods. For testing and validation, all datasets were normalized in the range of 0-1 using the minimum and maximum values of the training set.
Stage 3 depicts the training and operational forecasting processes of the model. For example, to forecast the PM 2.5 concentrations in January in Seoul, 120 suitable models were selected from a total of 2880 models. In other words, 40 bagged ensembles for the November-December-January, December-January-February, and January-February-March models in the SMA + Gangwon West broad region were selected. Although our LSTM-based forecast system consisted of a large number of model sets, it was designed to enable a training process within 48 h (a few hours with two NVIDIA GeForce RTX 2080 Ti graphic cards). The detailed training procedures for the regional and seasonal models are described in the Appendix.

LSTM Algorithm
The RNN algorithm was designed to search for linkages among all the variables in a given time series. The crucial weakness of the RNN method is the vanishing gradient problem, where there is a rapid decrease in memory information during an increase in the time interval between the information and forecast time. LSTM was developed to solve this problem, and is represented by Eqs. 1-6 (Gers et al. 2000).
where I t , F t , G t , and O t denote the input, forget, cell, and output gates at the t-th time, respectively; W i, f, g, o and b i, f, g, o denote the weights and biases for each gate, respectively; C t and H t denote the cell and hidden states at the t-th time, respectively; σ denotes a logistic sigmoid function, and sin denotes a sinusoidal function (Sitzmann et al. 2020). In this study, the sinusoidal function was used instead of the hyperbolic tangent function, which is the basic activation function in a traditional LSTM cell. Figure 3 illustrates the flow of the cell and the hidden states inside the LSTM memory cell. Four gates (forget, input, cell, and output gates) played a role in adjusting state information. The cell state is a long-term memory device that encodes data over all time steps. The new cell state (C t ) in the present time step was updated by combining the two processes of determining which information was loaded from the past and stored in the present (Fig. 3a). First, the forget gate (F t ) removed trivial information about the cell state (C t-1 ) transferred from the previous time step (F t × C t-1 in Eq. 5). Second, the input gate (I t ) regulated the amount of information in the cell gate (G t ) (I t × G t in Eq. 5). The output gate (O t ) produced the necessary information from the activated cell state at each time step (Fig. 3b), which is called the hidden state (H t ; Eq. 6). The hidden state is not the final output, but is encoded information that focuses more on the present time step, unlike the cell state. The internal process of the LSTM memory cell described above was performed for each time step. (1)

Evaluation Metrics
The CMAQ-solely, CMAQ-LSTM (LSTM hereafter), and AirKorea forecasts were evaluated against in situ observations. While observations, CMAQ, and LSTM provide PM 2.5 mass concentrations (in μg m −3 ), the AirKorea forecasters designate AQ in four grades. For consistency with AirKorea forecasts, the PM 2.5 mass concentrations values were converted into the four grades. The evaluation metrics included the root-mean-square error (RMSE), accuracy, probability of detection (POD), false alarm rate (FAR), receiver operating characteristic (ROC) curve, and area under the ROC curve (AUC). Because RMSE is calculated using the observed and model concentration differences (Eq. 7), it is not applicable to the evaluations of AirKorea forecasts. Accuracy represents the hit rate within the four grades (Eq. 8). POD denotes the rate at which the model detects both high and very high grades of public interest. As shown in Eq. 9, POD is the ratio of forecasts to observations for those two grades. The FAR is the rate of incorrect forecasts-for example, the high and very high grades against the observed low and moderate grades (Eq. 10).
The ROC curve represents the binary classification performance of alert (≥ a certain threshold) and non-alert (< a certain threshold) events, which are expressed in two axes: the true positive rate (TPR) on the y-axis versus the false positive rate (FPR) on the x-axis (Fawcett 2006). TPR is the probability of correctly forecasting the observed alert events (Eq. 11). The FPR is the probability of falsely rejecting nonalert events (Eq. 12). The alert (non-alert) refers to the days on which the forecasted or observed PM 2.5 concentrations are above (below) the threshold value. If the threshold is 35 μg m −3 , the alert is equal to high and very high grades, so POD and TPR display the same result. The AUC is a quantification of the ROC curve area (Eq. 13). As the point (FPR, TPR) approaches (1, 1) for any threshold in the ROC curve, the PM 2.5 threshold becomes smaller. Therefore, most observations and forecasts become classified as alerts. On the other hand, as point (FPR, TPR) approaches (0, 0), the threshold becomes larger, which causes frequent non-alert events to occur in both observations and forecasts. Thus, if the curve is located on the upper left and the AUC is close to 1, the model shows a better performance.
where F conc. and O conc. denote forecasted and observed PM 2.5 mass concentrations, respectively. N is the number of total samples, F and O with low, moderate, high, and very high subscripts respectively denote forecasted and observed PM 2.5 grades, alert and non-alert subscripts are binary False alarm rate(FAR) = F high and very high O low and moderate F high and very high , Area under the ROC curve(AUC) = ∫

Evaluation of AQ Forecasts
The period of training, validation, and testing (i.e., evaluation) of the LSTM model should be chosen from among the seven years in which observations, CMAQ, and WRF simulations were prepared. The atmospheric PM 2.5 concentrations reduced substantially in the two years of 2020 and 2021 due to the COVID-19 pandemic where the CMAQ and LSTM forecasts generally overestimated PM concentrations (not shown). For this reason, 2019, the pre-pandemic period, was taken as the evaluation period, and the remaining period including 2020-2021 was used for the training and validation period. Although the inclusion of these two years in the training period strengthened the imbalance in the proportion for each grade, the overall performance was slightly improved in the LSTM forecast performance (not shown). Figure 4 presents the errors in PM 2.5 concentrations for Day+1 LSTM forecasts for the six broad regions. For convenience, the model errors were calculated and displayed at 10 μg m −3 intervals. Of the 365 days in 2019, the proportion of the days in each bin is shown in the parentheses beneath the x-axis. Depending on the region, the proportion of two bins < 25 μg m −3 is 65-83%, and that of 25-35 μg m −3 is 12-18%; thus, the summed proportion in these three bins occupies over 83% of the entire year. The high and very high grades (bins ≥ 35 μg m −3 ) were most frequent (17%) in SMA + Gangwon West and Chungcheong-do ( Fig. 4a and b) while the other four regions showed much smaller frequencies (< 12%) with the smallest in Jeju-do ( Fig. 4c-f). The model forecast biases were positive/negative values for small/large PM 2.5 concentrations below/above 25 μg m −3 . The positive biases did not exceed 4 μg m −3 in the median value, but the negative ones were much larger, with a median value of −6 to −34 μg m −3 in three bins corresponding to the high and very high grades. It is supposed that the LSTM model underestimates high PM 2.5 (≥ 35 μg m −3 ) concentrations. Despite these biases, the forecasted annual mean concentration agrees reasonably well with the observations.

Evaluation of the LSTM Forecasts
The errors in Day+2 forecasts were similar to those of Day+1, except for larger absolute values (not shown). Note that the numerical weather prediction model errors increased with advancing forecast lead-time. In the LSTM forecasts, the effects of the model data were dominant over those of the observation data and contributed more to Day+2 than to Day+1 (Kim et al. 2022). The increase of errors in the WRF-CMAQ forecast with increasing forecast lead time amplified the LSTM model errors for longer the forecast lead times. Figure 5 shows the ROC curve of the Day+1 forecasts for the six broad regions. The thresholds for the four points were the same as those in Fig. 4, except for 55 μg m −3 . For the 15 μg m −3 threshold (O), the alert cases included the moderate, high, and very high grades, and the non-alert cases included the low grade. For the six broad regions, TPR ranged from 0.84 to 0.97, close to 1, but FPR was only 0.23-0.42. The increase in FPR compared to TPR in the range of FPR ≥ 0.5 demonstrates that the PM 2.5 concentrations were overestimated in the range < 15 μg m −3 as shown in Fig. 4. The 25 μg m −3 threshold (×) corresponded to the median of the moderate grade. At this reference point, the FPR and TPR values for four broad regions (SMA + Gangwon West, Chungcheong-do, Gyeongsang-do, and Jeolla-do) were 0.13-0.15 and 0.80-0.83, respectively, indicating satisfactory classification ability (Fig. 5a-d). At the 35 μg m −3 threshold (Δ), the FPR had a value close to 0 with little regional deviation. However, the TPR values were 0.26-0.63, 0.17-0.30 lower than those in the 25 μg m −3 threshold, showing a large regional discrepancy. In particular, in Gangwon East and Jeju-do, where PM concentrations were relatively low, the TPR was approximately 0.3, making the model vulnerable to more frequent high-and very high-grade alarms (Fig. 5e-f). For a threshold of ≥ 45 μg m −3 (+), the TPR decreased rapidly owing to the large negative bias in the LSTM model results (see Fig. 4). A few outliers were found in Gangwon East in the threshold range of 43-63 μg m −3 (Fig. 5e), though there were only four such cases (1.2%). The AUC of Day+1 was 0.87-0.93, indicating that the LSTM model exhibited good forecast skill. Although not shown in the figure, the AUC of Day+2 was approximately the same as that of Day+1 in all broad regions.
The contributions of the input values to the LSTM model varied for each month, and the model performance varied from month to month. Table 2 illustrates the three evaluation parameters (accuracy, POD, and FAR) for Days +1 and +2 in the six broad regions for each season in terms of the four grades, not PM 2.5 concentrations. The accuracy values remained generally similar across all six broad regions and were slightly larger (but statistically meaningless) in summer and fall than in winter and spring. The accuracy was 68-85% on Day+1 and 2-15% lower on Day+2. Second, PODs on Day+1 showed larger values compared to those of the annual (60%) and winter means (70%) in relatively more polluted regions (SMA + Gangwon West and Chungcheongdo). The values were 4-16% smaller on Day+2. In contrast, PODs were below the annual mean of 39% in relatively more pristine regions (Gangwon East and Jeju-do), and below 56% in winter.
As FAR indicates the degree to which both high-and very high-grade forecasts fail, this value may increase with POD. Contrary to this expectation, however, the LSTM showed low FAR values for more polluted regions and seasons. The FARs were below 17% in regions SMA + Gangwon West and Chungcheong-do during winter, demonstrating LSTM's excellent performance in forecasting both high and very high grades. The FAR values were somewhat greater in spring than in winter, and were even larger in summer and fall. In addition, the values were generally higher on Day+2 than on Day+1.

Inter-Comparison of the Forecast Skill of the Three Models
The performance of the CMAQ-solely, LSTM, and Air-Korea forecasts for up to two days was evaluated for the 19 districts. Figure 6 shows the three forecast parameters and the RMSE for the forecast days and districts. The values on Day+1's and Day+2's are marked by colored boxes and black dots, respectively. It is apparent that the CMAQ-solely forecast exhibited low accuracy (~70%) with high POD and FAR values compared to the other two forecasts (Fig. 6a-c). Both high POD (31-98%) and FAR (51-86%) indicate that the CMAQ forecasts tended to overestimate PM 2.5 concentrations ( Fig. 6b and c; Ho et al. 2021). The national average values of accuracy and FAR on Day+2 were similar to those on Day+1, but the POD was 6% lower. The RMSEs of  (Fig. 6d). Overall, the performance of the CMAQ-solely forecasts was demonstrated to be below optimal for operational purposes. The LSTM and AirKorea forecasts showed similar forecast skill, too close to tell which one is better for operational purposes. For both LSTM and AirKorea, the accuracy was 68-80%, POD was 27-81%, and FAR was 12-69% ( Fig. 6a-c). LSTM showed a higher accuracy (~1.6%) than AirKorea for nine districts while AirKorea demonstrated higher accuracy for the remaining 10 districts (Fig. 6a). For most districts, the differences between the accuracies of Day+1 and Day+2 were less than 6%. The PODs of the Air-Korea forecasts were approximately 8% above those of the LSTM forecasts, and 25-31% higher for JLS and JJD than the LSTM forecasts (Fig. 6b). The LSTM forecasts yielded moderately smaller POD values (< 39%) for the GWE and JJD, where the CMAQ-solely forecasts produced noticeably low POD, 50% for GWE and 31% for JJD. The deterioration of POD (up to −30%) Day+2 to Day+1 was greater than that of the accuracy. The AirKorea forecasts showed larger FAR than the LSTM forecasts, similarly as the POD from Day+1 and Day+2 (Fig. 6c).
The AirKorea forecasts were produced subjectively by the NIER forecasters based on the CMAQ forecasts and weather patterns, and given the analogous forecast skill of LSTM to that of AirKorea, it suggests that the forecast skill of LSTM is at the same level as that of trained human forecasters. The LSTM forecasts yielded an RMSE of 5-9 μg m −3 on Day+1, about half of that in CMAQ forecasts (Fig. 6d). This value increased on Day+2 by 1.7 μg m −3 from Day+1. Figure 7 shows the accuracy of the LSTM and Air-Korea forecasts across the four AQ grades in the six broad regions. In Fig. 6, the colored boxes and black dots denote the values on Day+1's and Day+2's, respectively. For the low and moderate grades, the two forecasts ('both' in the figure) demonstrated accuracies of 48-78% across the broad regions. The hit rate for either the LSTM-solely' or the 'AirKorea-solely' forecasts was ≤ 23%. The FAR in both ('neither' in the figure) was below 21%; the values for the moderate grade were even smaller. No significant differences in skill between Day+1's and Day+2's were found.
For the high and very high grades, the accuracy varied notably according to regions and forecasts, where 'successes' became smaller (mostly < 49%), and 'no successes' became larger (0-60%). This implies that more efforts are needed to improve forecasting high PM 2.5 events. When only one of the two forecasts was considered, the AirKoreasolely forecasts were better than the LSTM-solely forecasts for the very high grade, with success ratios of 38% for Chungcheong-do, 75% for Gyeongsang-do, and 100% for Gangwon East (Fig. 7b, c, and e, respectively). For Jeju-do, a very high grade event never occurred as correctly forecasted by both forecasts (Fig. 7f). Table 3 compares the accuracy averaged over the 19 districts according to the grades and seasons for the LSTM and AirKorea forecasts. The accuracy was highest for the moderate grade in all seasons and for forecast lead times (i.e., Days +1 and +2). Both systems simultaneously failed (i.e., 'neither') for up to 10%. For the moderate grade, the percentages where both forecasts were accurate (i.e., 'both') were 75-78% for Day+1 and 53-58% for Day+2 except for fall. Accuracy of the LSTM forecasts (i.e., combining both and LSTM-solely) was 85-93% on Day+1 and 84-92% on For the low grade, 'both' decreased and 'neither' increased significantly. This change is obvious for Day+2, where they reached 6-38% and 14-42%, respectively. For Day+1, AirKorea-solely forecasts showed higher skill than the LSTM-solely forecasts; the opposite was true for Day+2. For the high and very high grades, both forecasts exhibited smaller values than those for the low grade. For the two highest grades, most forecasts of the two systems did not exceed 50% accuracy on Day+1. The accuracy ratio further worsened (up to 86% for neither) on Day+2.

Conclusion
This study evaluated two-day PM 2.5 forecasts for Korea using three forecast systems, CMAQ-solely, CMAQ-LSTM, and AirKorea over the seven-year period 2015-21 with the year 2019 as the testing (or evaluation) period; the remaining years were used as the training and validation periods. Including the anomalous years of 2020-21 due to the COVID19 pandemic in the evaluation period yielded essentially identical results as presented in this study (not shown). The results for the six broad regions showed median LSTM forecast error and observation error values to be −4.6 and 4 μg m −3 for the low and moderate grades, and −34.1 to −4 μg m −3 for the high and very high grades. In the ROC curve that represents the classification performance of alert and non-alert divided by an arbitrary PM 2.5 thresholds, the LSTM model yielded the best skill for the moderate grade with near-zero bias. For the threshold values of the lowgrade, those of the TPR were nearly one, and the FPR was larger than 0.4. In contrast, for the high and very high grades, FAR was nearly zero, and the TPR decreased to below 0.6. The overall performance of the LSTM model was acceptable, with AUC over 0.87 and 0.84 in Day+1 and Day+2, respectively.
Averaged over the all broad forecast regions with the grade-based evaluation of the LSTM forecasts, the accuracy, POD, and FAR values on Day+1 (Day+2) were 76% (72%), 60% (50%), and 26% (29%), respectively. While the accuracy remained similar for all broad regions and seasons, POD exhibited high values in polluted regions (SMA + Gangwon west and Chungcheongdo) and during winter, and FAR showed high values in pristine regions (Gangwon East and Jeju-do) and during all seasons except for in winter. Furthermore, the LSTM forecasts were compared against the CMAQ-solely and Air-Korea forecasts. The CMAQ forecasts underperformed the other two forecasts, with low accuracy and high FAR due to overestimation. The average RMSE of the CMAQ forecasts for Day+1 was 12.8 μg m −3 , which is 1.8 times that of the LSTM model. AirKorea, on the other hand, showed similar accuracy as LSTM, yielding 8% higher POD and 6% higher FAR compared to those from the LSTM on Day+1. Both models showed high accuracy for low and moderate grades across all the broad regions. However, for high and very high grades, the hit ratios decreased to 49% and the failure ratio increased to 60% along with large regional deviations.
Using the LSTM-based AI model in conjunction with CMAQ-based AQ forecasts has yielded a performance level similar to experienced human forecasters who used subjective interpretation of observations and/or numerical model predictions. However, this does not warrant that the AI model can replace the human forecasters. The AI model has inherent limitations: a black box for which the decision-making processes in producing forecasts were not interpreted at all. It is also a major weakness that the cause of incorrect forecasts yielded by the AI is not explained. Although research on explainable AI actions has been actively conducted to solve this problem, it is currently insufficient. Moreover, the LSTM model was highly dependent on inputs from the numerical model forecasts. Because an advanced numerical model makes the AI model more complete, the development of numerical models as well as AI models is necessary. Therefore, current LSTM forecasts can only be recommended as additional references for human forecasters, something that the NIER has already begun to do. districts. Numbers are in the order of both/LSTM-solely/ AirKorea-solely/neither.denotes non-value. The meaning of 'both', 'LSTM-solely', 'AirKorea-solely', and 'neither' is the same as Fig. 7